[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) by froststeam · Pull Request #22051 · sgl-project/sglang

froststeam · 2026-04-03T14:36:46Z

Motivation

This PR fixes the Flash Attention backend support that was previously merged in PR #17985 but later reverted in PR #22002 due to a bug. The original commit 2373552 caused CI failures (see failed CI job).

Previously, the MUSA-adapted flash attention implementation had a bug in the _forward_extend_impl method. The code was missing a proper mechanism to select the correct kernel implementation based on the fa_impl_ver parameter, causing it to always use the default FA3 implementation regardless of the specified version.

Fix Applied

After rebasing to the latest main branch, the kernel selection logic has been refactored and moved to the FlashAttentionBackend.__init__ method. This ensures that the appropriate flash attention implementation is selected during initialization based on the fa_impl_ver parameter.

Moved kernel selection to __init__: The logic to select the correct flash attention kernel (including MUSA-specific implementations) is now handled in the FlashAttentionBackend.__init__ method, where two instance variables are initialized:
- self.flash_attn_with_kvcache: For cached attention operations
- self.flash_attn_varlen_func: For variable-length attention operations
Updated forward methods: Both _forward_extend_impl and _forward_decode_impl now use these instance variables instead of directly calling the default implementations, ensuring the correct kernel is used based on the initialized configuration.

Accuracy Tests

root@324a10004:/sgl-workspace/sglang# python3 -m sglang.launch_server \
            --model-path /mnt/seed17/001688/models/Qwen2.5-7B-Instruct/ \
            --served-model-name base-model \
            --trust-remote-code \
            --mem-fraction-static 0.80 \
            --cuda-graph-bs $(seq 1 2) \
            --host 0.0.0.0 \
            --port 30000 \
            --attention-backend fa3 \
            --tp-size 2 \
            --pp-size 2 \
            --disable-radix-cache \
            --chunked-prefill-size -1
2026-04-09 20:05:05 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(

2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : Platform plugin musa is activated
2026-04-09 20:05:06 | __init__ | 140537684047680 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:06 | _custom_ops | 140537684047680 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:06 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:06 | warnings | 140537684047680 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:07 | server_args | 140537684047680 | WARNING : Pipeline parallelism is incompatible with overlap schedule.
[2026-04-09 20:05:07] server_args=ServerArgs(model_path='/mnt/seed17/001688/models/Qwen2.5-7B-Instruct/', tokenizer_path='/mnt/seed17/001688/models/Qwen2.5-7B-Instruct/', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, enable_http2=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.8, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=-1, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=64, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='musa', tp_size=2, pp_size=2, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_response_default_include_usage=False, incremental_streaming_output=False, enable_streaming_session=False, random_seed=400815404, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_mfu_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='base-model', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, experts_shared_outer_loras=None, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='pytorch', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_dflash_block_size=None, speculative_dflash_draft_window_size=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_max_trie_depth=18, speculative_ngram_capacity=10000000, speculative_ngram_external_corpus_path=None, speculative_ngram_external_sam_budget=0, speculative_ngram_external_corpus_max_tokens=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enforce_disable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_hisparse=False, hisparse_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=True, cuda_graph_max_bs=2, cuda_graph_bs=[1, 2], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=-1, piecewise_cuda_graph_tokens=[], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, gc_threshold=None, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_prefill_context_parallel=False, prefill_cp_mode='in-seq-split', enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, engine_info_bootstrap_port=6789, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-04-09 20:05:08] Using default HuggingFace chat template with detected content format: string
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140102443353920 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140314952234816 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 139839021729600 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140605230712640 | INFO : Platform plugin musa is activated
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Available plugins for group vllm.platform_plugins:
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : - musa -> vllm_musa:register
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
2026-04-09 20:05:14 | __init__ | 140715677046592 | INFO : Platform plugin musa is activated
2026-04-09 20:05:15 | __init__ | 140102443353920 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140102443353920 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | __init__ | 140314952234816 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140314952234816 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140102443353920 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | warnings | 140314952234816 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 139839021729600 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 139839021729600 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 139839021729600 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 140605230712640 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140605230712640 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140605230712640 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | __init__ | 140715677046592 | INFO : No platform detected, vLLM is running on UnspecifiedPlatform
2026-04-09 20:05:15 | _custom_ops | 140715677046592 | WARNING : Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
2026-04-09 20:05:15 | warnings | 140715677046592 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/awq.py:87: UserWarning: Only CUDA, HIP and XPU support AWQ currently.
  warnings.warn(f"Only CUDA, HIP and XPU support AWQ currently.")

2026-04-09 20:05:15 | warnings | 140102443353920 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140314952234816 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 139839021729600 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140605230712640 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

2026-04-09 20:05:15 | warnings | 140715677046592 | WARNING : /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/layers/quantization/gguf.py:48: UserWarning: Only CUDA and MUSA support GGUF quantization currently.
  warnings.warn(f"Only CUDA and MUSA support GGUF quantization currently.")

[2026-04-09 20:05:15 PP1 TP0] Process 1187068 gpu_id 2 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
[2026-04-09 20:05:15 PP1 TP1] Process 1187069 gpu_id 3 is running on CPUs: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]
[2026-04-09 20:05:15 PP0 TP1] Process 1187067 gpu_id 1 is running on CPUs: [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127]
[2026-04-09 20:05:16 PP0 TP0] Process 1187066 gpu_id 0 is running on CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
[2026-04-09 20:05:16 PP1 TP1] Init torch distributed begin.
[2026-04-09 20:05:16 PP1 TP0] Init torch distributed begin.
[2026-04-09 20:05:16 PP0 TP1] Init torch distributed begin.
[2026-04-09 20:05:16 PP0 TP0] Init torch distributed begin.
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:18 PP1 TP0] sglang is using nccl==2.11.4
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:18 PP0 TP0] sglang is using nccl==2.11.4
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-09 20:05:20 PP0 TP1] sglang is using nccl==2.11.4
[2026-04-09 20:05:20 PP0 TP0] sglang is using nccl==2.11.4
[2026-04-09 20:05:20 PP0 TP0] Init torch distributed ends. elapsed=3.35 s, mem usage=0.89 GB
[2026-04-09 20:05:20 PP1 TP1] Init torch distributed ends. elapsed=3.69 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP0 TP1] Init torch distributed ends. elapsed=3.64 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP1 TP0] Init torch distributed ends. elapsed=3.64 s, mem usage=0.97 GB
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_audio: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_causal: cannot import name 'Gemma4TextConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_mm: cannot import name 'Gemma4AudioConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.gemma4_vision: cannot import name 'Gemma4VisionConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/root/.virtualenvs/sglang-0.5.6/lib/python3.10/site-packages/transformers/__init__.py)
[2026-04-09 20:05:20 PP1 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP0 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP0 TP1] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP1 TP0] Ignore import error when loading sglang.srt.models.midashenglm: No module named 'torchaudio'
[2026-04-09 20:05:20 PP1 TP0] Load weight begin. avail mem=78.37 GB
[2026-04-09 20:05:20 PP1 TP1] Load weight begin. avail mem=78.37 GB
[2026-04-09 20:05:20 PP0 TP0] Load weight begin. avail mem=78.26 GB
[2026-04-09 20:05:20 PP0 TP1] Load weight begin. avail mem=78.37 GB
Multi-thread loading shards:  50% Completed | 2/4 [00:06<00:07,  3.54s/it][2026-04-09 20:05:30 PP0 TP1] Parameter lm_head.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP1] Parameter model.norm.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP0] Parameter lm_head.weight not found in params_dict
[2026-04-09 20:05:30 PP0 TP0] Parameter model.norm.weight not found in params_dict
Multi-thread loading shards: 100% Completed | 4/4 [00:12<00:00,  3.05s/it]
[2026-04-09 20:05:36 PP0 TP1] Load weight end. elapsed=15.82 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:36 PP0 TP0] Load weight end. elapsed=15.83 s, type=Qwen2ForCausalLM, avail mem=74.40 GB, mem usage=3.86 GB.
[2026-04-09 20:05:36 PP0 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-09 20:05:37 PP1 TP1] Parameter model.embed_tokens.weight not found in params_dict
[2026-04-09 20:05:37 PP1 TP0] Parameter model.embed_tokens.weight not found in params_dict
[2026-04-09 20:05:37 PP1 TP1] Load weight end. elapsed=16.38 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:37 PP1 TP0] Load weight end. elapsed=16.38 s, type=Qwen2ForCausalLM, avail mem=74.52 GB, mem usage=3.86 GB.
[2026-04-09 20:05:37 PP1 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-09 20:05:37 PP0 TP0] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP1 TP0] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP0 TP0] Memory pool end. avail mem=14.99 GB
[2026-04-09 20:05:37 PP1 TP0] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:37 PP0 TP1] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP1 TP1] KV Cache is allocated. #tokens: 4400192, K size: 29.37 GB, V size: 29.37 GB
[2026-04-09 20:05:37 PP0 TP1] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:37 PP1 TP1] Memory pool end. avail mem=15.11 GB
[2026-04-09 20:05:38 PP0 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=15.05 GB
[2026-04-09 20:05:38 PP1 TP0] Capture cuda graph bs [1, 2]
[2026-04-09 20:05:38 PP0 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=14.94 GB
[2026-04-09 20:05:38 PP0 TP0] Capture cuda graph bs [1, 2]
Capturing batches (bs=1 avail_mem=14.32 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:02<00:00, 31.37s/it]
[2026-04-09 20:06:41 PP1 TP0] Registering 56 cuda graph addresses
[2026-04-09 20:06:42 PP1 TP1] Capture cuda graph end. Time elapsed: 63.94 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:42 PP1 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:42 PP1 TP0] Capture cuda graph end. Time elapsed: 63.95 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:42 PP1 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
Capturing batches (bs=1 avail_mem=14.21 GB): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:04<00:00, 32.34s/it]
[2026-04-09 20:06:43 PP0 TP0] Registering 58 cuda graph addresses
[2026-04-09 20:06:44 PP0 TP1] Capture cuda graph end. Time elapsed: 66.49 s. mem usage=0.74 GB. avail mem=14.32 GB.
[2026-04-09 20:06:44 PP0 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:44 PP0 TP0] Capture cuda graph end. Time elapsed: 66.50 s. mem usage=0.74 GB. avail mem=14.20 GB.
[2026-04-09 20:06:44 PP0 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-09 20:06:45 PP0 TP0] max_total_num_tokens=4400192, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=14.20 GB
[2026-04-09 20:06:45 PP1 TP0] max_total_num_tokens=4400192, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=14.32 GB
[2026-04-09 20:06:45] INFO:     Started server process [1185766]
[2026-04-09 20:06:45] INFO:     Waiting for application startup.
[2026-04-09 20:06:45] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2026-04-09 20:06:45] INFO:     Application startup complete.
[2026-04-09 20:06:45] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2026-04-09 20:06:46] INFO:     127.0.0.1:35942 - "GET /model_info HTTP/1.1" 200 OK
[2026-04-09 20:06:47 PP0 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP0 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP1 TP1] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP1 TP0] /mnt/seed17/001688/qzg/qwen3/260124/github/sglang/python/sglang/srt/distributed/parallel_state.py:1058: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/pytorch/torch/csrc/utils/tensor_new.cpp:1581.)
  object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)

[2026-04-09 20:06:48 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-09 20:06:48 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-09 20:06:49] INFO:     127.0.0.1:35956 - "POST /generate HTTP/1.1" 200 OK
[2026-04-09 20:06:49] The server is fired up and ready to roll!
[2026-04-09 20:06:59 PP0 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 5.84
[2026-04-09 20:06:59 PP1 TP0] Prefill batch, #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 6.08
[2026-04-09 20:06:59 PP0 TP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.39, #queue-req: 0
[2026-04-09 20:06:59 PP1 TP0] Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.39, #queue-req: 0
[2026-04-09 20:07:01 PP0 TP0] Decode batch, #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 24.78, #queue-req: 0
[2026-04-09 20:07:01 PP1 TP0] Decode batch, #running-req: 1, #token: 128, token usage: 0.00, cuda graph: True, gen throughput (token/s): 24.77, #queue-req: 0
[2026-04-09 20:07:01 PP0 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.19, #queue-req: 0
[2026-04-09 20:07:01 PP1 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.23, #queue-req: 0
[2026-04-09 20:07:02 PP0 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.99, #queue-req: 0
[2026-04-09 20:07:02 PP1 TP0] Decode batch, #running-req: 1, #token: 192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 59.03, #queue-req: 0
[2026-04-09 20:07:03 PP0 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.76, #queue-req: 0
[2026-04-09 20:07:03 PP1 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.76, #queue-req: 0
[2026-04-09 20:07:04 PP0 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.73, #queue-req: 0
[2026-04-09 20:07:04 PP1 TP0] Decode batch, #running-req: 1, #token: 256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.73, #queue-req: 0
[2026-04-09 20:07:04 PP0 TP0] Decode batch, #running-req: 1, #token: 320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.69, #queue-req: 0
[2026-04-09 20:07:04 PP1 TP0] Decode batch, #running-req: 1, #token: 320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 58.69, #queue-req: 0
[2026-04-09 20:07:05] INFO:     127.0.0.1:56346 - "POST /generate HTTP/1.1" 200 OK

python3 /sglang/python/sglang/test/few_shot_gsm8k.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:54<00:00,  1.75it/s]
Accuracy: 0.870
Invalid: 0.000
Latency: 114.105 s
Output throughput: 322.844 token/s

Speed Tests and Profiling

Checklist

Format your code according to the https://docs.sglang.io/developer_guide/contribution_guide.html#format-code-with-pre-commit .
Add unit tests according to the https://docs.sglang.io/developer_guide/contribution_guide.html#run-and-add-unit-tests .
Update documentation according to https://docs.sglang.io/developer_guide/contribution_guide.html#write-documentations .
Provide accuracy and speed benchmark results according to https://docs.sglang.io/developer_guide/contribution_guide.html#test-the-accuracy and https://docs.sglang.io/developer_guide/contribution_guide.html#benchmark-the-speed .
Follow the SGLang code style https://docs.sglang.io/developer_guide/contribution_guide.html#code-style-guidance .

Review and Merge Process

Ping Merge Oncalls to start the process. See the https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md#pull-request-merge-process .
Get approvals from https://github.com/sgl-project/sglang/blob/main/.github/CODEOWNERS and other reviewers.
Trigger CI tests with https://docs.sglang.io/developer_guide/contribution_guide.html#how-to-trigger-ci-tests or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Related Links:

Original merge PR: [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) #17985
Revert PR: Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" #22002 (commit efa7b2d)
Failed CI job: https://github.com/sgl-project/sglang/actions/runs/23928333410/job/69789912493
Fix commits: 0af5fe5

gemini-code-assist

Code Review

This pull request introduces support for the MUSA (Moore Threads GPU) hardware backend, specifically focusing on Flash Attention integration. It adds necessary dependencies, configuration parameters, and a new MUSA-specific attention module that wraps the mate library's flash attention functions. The implementation uses a thread-local context manager to automatically inject scheduler metadata into attention calls. Key changes include updates to the attention registry, the FlashAttentionBackend to handle MUSA-specific logic, and server argument adjustments for MUSA compatibility. Feedback highlights potential issues with global buffer safety in multi-GPU environments, metadata cache collisions due to non-unique keys, and the implications of ignoring cu_seqlens_k_new in the MUSA implementation.

yeahdongcn

I think it would be better to split this into two commits: one carrying over changes from the previous PR, and another fixing the regression in selecting FA kernels for different NVIDIA GPU architectures. This should make it easier for the SGLang core team to review.

yeahdongcn · 2026-04-06T13:23:48Z

/tag-and-rerun-ci

yeahdongcn · 2026-04-06T22:50:12Z

/rerun-failed-ci

yeahdongcn · 2026-04-07T01:48:41Z

/rerun-failed-ci

yeahdongcn · 2026-04-07T03:43:36Z

/rerun-failed-ci

yeahdongcn · 2026-04-07T06:10:32Z

/rerun-failed-ci

yeahdongcn · 2026-04-07T09:26:22Z

/rerun-failed-ci

froststeam · 2026-04-07T13:01:43Z

/rerun-failed-ci

froststeam · 2026-04-08T09:53:52Z

/rerun-failed-ci

…ensor Engine)

yeahdongcn · 2026-04-09T13:49:15Z

/rerun-failed-ci

yeahdongcn · 2026-04-09T22:54:59Z

/rerun-failed-ci

froststeam · 2026-04-10T01:42:17Z

/rerun-failed-ci

yeahdongcn · 2026-04-10T07:52:31Z

/rerun-failed-ci

yeahdongcn · 2026-04-10T09:40:20Z

Hi @Fridge003 and @Kangyan-Zhou, all NVIDIA CI checks have passed. Could you please take a look if we can merge this? Thanks!

…ensor Engine) (#22051) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>

…ensor Engine) (sgl-project#22051) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>

froststeam requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners April 3, 2026 14:36

github-actions bot added the dependencies Pull requests that update a dependency file label Apr 3, 2026

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated

Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated

Comment thread python/sglang/srt/hardware_backend/musa/attention/flash_attention.py Outdated

froststeam changed the title ~~[MUSA][9/N] Re-introduceFA3 attention backend support through MATE (MUSA AI Tensor Engine)~~ [MUSA][9/N] Re-introduce FA3 attention backend support through MATE Apr 3, 2026

yeahdongcn reviewed Apr 5, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashattention_backend.py Outdated

yeahdongcn requested a review from Kangyan-Zhou April 5, 2026 13:20

froststeam force-pushed the qzg/musa-fa-fix branch from 3369ebb to ba20eee Compare April 6, 2026 11:05

froststeam changed the title ~~[MUSA][9/N] Re-introduce FA3 attention backend support through MATE~~ [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) Apr 6, 2026

yeahdongcn added the mthreads label Apr 6, 2026

froststeam force-pushed the qzg/musa-fa-fix branch 3 times, most recently from 9cb257c to 0af5fe5 Compare April 6, 2026 12:58

yeahdongcn reviewed Apr 6, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashattention_backend.py Outdated

yeahdongcn approved these changes Apr 6, 2026

View reviewed changes

github-actions bot added the run-ci label Apr 6, 2026

yeahdongcn mentioned this pull request Apr 6, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

froststeam force-pushed the qzg/musa-fa-fix branch from 0af5fe5 to 05ff0c4 Compare April 7, 2026 10:59

froststeam force-pushed the qzg/musa-fa-fix branch from 05ff0c4 to 1efadad Compare April 8, 2026 08:15

froststeam requested review from BBuf, DarkSharpness, HydraQYH, celve and yuan-luo as code owners April 8, 2026 08:15

github-actions bot added the jit-kernel label Apr 8, 2026

froststeam force-pushed the qzg/musa-fa-fix branch from 1efadad to 9bfa839 Compare April 8, 2026 10:49

Fridge003 reviewed Apr 8, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashattention_backend.py

froststeam force-pushed the qzg/musa-fa-fix branch from 9bfa839 to 4ef02b1 Compare April 9, 2026 08:10

yeahdongcn reviewed Apr 9, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/attention_registry.py

Comment thread python/sglang/srt/hardware_backend/musa/attention/flashattention_backend.py Outdated

froststeam force-pushed the qzg/musa-fa-fix branch from 4ef02b1 to 915d2da Compare April 9, 2026 08:22

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

782afec

…ensor Engine)

froststeam force-pushed the qzg/musa-fa-fix branch from 915d2da to 782afec Compare April 9, 2026 11:57

Fridge003 approved these changes Apr 10, 2026

View reviewed changes

Fridge003 merged commit f7a1740 into sgl-project:main Apr 10, 2026
293 of 342 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

5b3d35c

…ensor Engine) (#22051) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI T…

dc71ab3

…ensor Engine) (sgl-project#22051) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>

Conversation

froststeam commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Fix Applied

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Apr 6, 2026

Uh oh!

yeahdongcn commented Apr 6, 2026

Uh oh!

yeahdongcn commented Apr 7, 2026

Uh oh!

yeahdongcn commented Apr 7, 2026

Uh oh!

yeahdongcn commented Apr 7, 2026

Uh oh!

yeahdongcn commented Apr 7, 2026

Uh oh!

froststeam commented Apr 7, 2026

Uh oh!

froststeam commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Apr 9, 2026

Uh oh!

yeahdongcn commented Apr 9, 2026

Uh oh!

froststeam commented Apr 10, 2026

Uh oh!

yeahdongcn commented Apr 10, 2026

Uh oh!

yeahdongcn commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

froststeam commented Apr 3, 2026 •

edited

Loading